Rewrite FileStream in terms of Morsel API#21342
Conversation
816d243 to
3346af7
Compare
| /// This groups together ready planners, ready morsels, the active reader, | ||
| /// pending planner I/O, the remaining files and limit, and the metrics | ||
| /// associated with processing that work. | ||
| pub(super) struct ScanState { |
There was a problem hiding this comment.
This is the new inner state machine for FileStream
There was a problem hiding this comment.
I think some more diagrams in the docstring of the struct and/or fields could help. I'm trying to wrap my head around how the IO queue and such work.
| use std::sync::Arc; | ||
| use std::sync::mpsc::{self, Receiver, TryRecvError}; | ||
|
|
||
| /// Adapt a legacy [`FileOpener`] to the morsel API. |
There was a problem hiding this comment.
This is an adapter so that existing FileOpeners continue to have the same behavior
| @@ -0,0 +1,556 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
This is testing infrastructure to write the snapshot tests
| return Poll::Ready(Some(Err(err))); | ||
| } | ||
| } | ||
| FileStreamState::Scan { scan_state: queue } => { |
There was a problem hiding this comment.
moved the inner state machine into a separate module/struct to try and keep indenting under control and encapsualte the complexity somewhat
| assert!(err.contains("FileStreamBuilder invalid partition index: 1")); | ||
| } | ||
|
|
||
| /// Verifies the simplest morsel-driven flow: one planner produces one |
There was a problem hiding this comment.
Here are tests showing the sequence of calls to the various morsel APIs. I intend to use this framework to show how work can migrate from one stream to the other
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
b5c452a to
d5a1f74
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| all-features = true | ||
|
|
||
| [features] | ||
| backtrace = ["datafusion-common/backtrace"] |
There was a problem hiding this comment.
I added this while debugging why the tests failed on CI and not locally (it was when this feature flag was on the Error messages got mangled).
I added a crate level feature to enable the feature in datafusion-common so I could reproduce the error locally
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
d5a1f74 to
b2c9bd6
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
adriangb
left a comment
There was a problem hiding this comment.
Ran out of time for the last couple of files. A lot of the comments are just tracking my thought process, I plan to go over them again to clarify my own understanding but maybe they're helpful as input on how the code reads top to bottom for a first time reader.
| /// Creates a `dyn Morselizer` based on given parameters. | ||
| /// | ||
| /// The default implementation preserves existing behavior by adapting the | ||
| /// legacy [`FileOpener`] API into a [`Morselizer`]. | ||
| /// | ||
| /// It is preferred to implement the [`Morselizer`] API directly by | ||
| /// implementing this method. | ||
| fn create_morselizer( | ||
| &self, | ||
| object_store: Arc<dyn ObjectStore>, | ||
| base_config: &FileScanConfig, | ||
| partition: usize, | ||
| ) -> Result<Box<dyn Morselizer>> { | ||
| let opener = self.create_file_opener(object_store, base_config, partition)?; | ||
| Ok(Box::new(FileOpenerMorselizer::new(opener))) | ||
| } |
| /// Configure the [`FileOpener`] used to open files. | ||
| /// | ||
| /// This will overwrite any setting from [`Self::with_morselizer`] | ||
| pub fn with_file_opener(mut self, file_opener: Arc<dyn FileOpener>) -> Self { |
There was a problem hiding this comment.
While I think it could make sense to keep FileOpener as a public API for building data sources (if we consider it simpler, for folks who don't care about perf), this method in particular seems like a mostly internal method (even if it is pub) on we might as well deprecate / remove.
There was a problem hiding this comment.
This method is the way we could keep using FileOpener (as it is simpler)
I am not sure how we could still allow using FileOpener but not keep this method
| /// The active reader, if any. | ||
| reader: Option<BoxStream<'static, Result<RecordBatch>>>, |
There was a problem hiding this comment.
Is there one ScanState across all partitions or one per partition? I'm guessing the latter: file_iter: VecDeque<PartitionedFile> is the files for this partition, we pump all of the files into one output stream of RecordBatch (reader). But we can have multiple planners / morsels ready and merge those all into a single stream of RecordBatch on the way out.
There was a problem hiding this comment.
One per partition
we pump all of the files into one output stream of RecordBatch (reader). But we can have multiple planners / morsels ready and merge those all into a single stream of RecordBatch on the way out.
My initial proposal (following @Dandandan 's original design" is that when possible the files are put into a shared queue so that when a FileStream is ready it gets the next file
I think once we get that structure in place, we can contemplate more sophisticated designs (like one filestream preparing a parquet file, and then divying up the record batches between other cores)
There was a problem hiding this comment.
Is there one ScanState across all partitions or one per partition? I'm guessing the latter: file_iter: VecDeque is the files for this partition
yes, it is one ScanState per partition
we pump all of the files into one output stream of RecordBatch (reader). But we can have multiple planners / morsels ready and merge those all into a single stream of RecordBatch on the way out.
Yes this is right
|
Ok the first PR in the chain is ready for review: (that is basically 50% of this PR) |
| /// Configure the [`FileOpener`] used to open files. | ||
| /// | ||
| /// This will overwrite any setting from [`Self::with_morselizer`] | ||
| pub fn with_file_opener(mut self, file_opener: Arc<dyn FileOpener>) -> Self { |
There was a problem hiding this comment.
This method is the way we could keep using FileOpener (as it is simpler)
I am not sure how we could still allow using FileOpener but not keep this method
| /// The active reader, if any. | ||
| reader: Option<BoxStream<'static, Result<RecordBatch>>>, |
There was a problem hiding this comment.
One per partition
we pump all of the files into one output stream of RecordBatch (reader). But we can have multiple planners / morsels ready and merge those all into a single stream of RecordBatch on the way out.
My initial proposal (following @Dandandan 's original design" is that when possible the files are put into a shared queue so that when a FileStream is ready it gets the next file
I think once we get that structure in place, we can contemplate more sophisticated designs (like one filestream preparing a parquet file, and then divying up the record batches between other cores)
8985e37 to
4084de9
Compare
…er` (#21327) ~(Draft until I am sure I can use this API to make FileStream behave better)~ ## Which issue does this PR close? - part of #20529 - Needed for #21351 - Broken out of #20820 - Closes #21427 ## Rationale for this change I can get 10% faster on many ClickBench queries by reordeirng files at runtime. You can see it all working together here: #21351 To do do, I need to rework the FileStream so that it can reorder operations at runtime. Eventually that will include both CPU and IO. This PR is a step in the direction by introducing the main Morsel API and implementing it for Parquet. The next PR (#21342) rewrites FileStream in terms of the Morsel API ## What changes are included in this PR? 1. Add proposed `Morsel` API 2. Rewrite Parquet opener in terms of that API 3. Add an adapter layer (back to FileOpener, so I don't have to rewrite FileStream in the same PR) My next PR will rewrite the FileStream to use the Morsel API ## Are these changes tested? Yes by existing CI. I will work on adding additional tests for just Parquet opener in a follow on PR ## Are there any user-facing changes? No
4084de9 to
8cd86f8
Compare
| /// The active reader, if any. | ||
| reader: Option<BoxStream<'static, Result<RecordBatch>>>, |
There was a problem hiding this comment.
Is there one ScanState across all partitions or one per partition? I'm guessing the latter: file_iter: VecDeque is the files for this partition
yes, it is one ScanState per partition
we pump all of the files into one output stream of RecordBatch (reader). But we can have multiple planners / morsels ready and merge those all into a single stream of RecordBatch on the way out.
Yes this is right
|
|
||
| /// A stream that iterates record batch by record batch, file over file. | ||
| pub struct FileStream { | ||
| /// An iterator over input files. |
There was a problem hiding this comment.
The state machine responsible for opening files and interacting with the Morsels is now in Scan State
ed29b29 to
393c03f
Compare
This comment has been minimized.
This comment has been minimized.
| } | ||
| } | ||
|
|
||
| /// Adapter for a [`MorselPlanner`] to the [`FileOpener`] API |
There was a problem hiding this comment.
shim layer removed as the FileStream uses the Morsel API natively now
| /// Verifies that a planner can traverse two sequential I/O phases before | ||
| /// producing one batch (similar to Parquet which does this0. | ||
| #[tokio::test] | ||
| async fn morsel_two_ios_one_batch() -> Result<()> { |
There was a problem hiding this comment.
Also it may not be multiple actual IOs, but rather that the IO future isn't ready for a few polls
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
This PR is now ready for review -- it shows no performance difference (as expected) but is one step closer to #21351 which does |
Stacked on
ParquetOpenertoParquetMorselizer#21327Which issue does this PR close?
Rationale for this change
The Morsel API allows for finer grain parallelism (and IO). It is important to have the FileStream work in terms of the Morsel API to allow future features (like workstealing, etc)
What changes are included in this PR?
I apologize for the large diff; Note about 1/2 of this PR is tests and a test framework to test the calling sequence of FileStream.
Are these changes tested?
Yes by existing functional and benchmark tests, as well as new functional snapshot based tests
Are there any user-facing changes?
No (not yet)